摘要 :
The predominant user activities on mobile architectures (e.g., smartphones) involve entering text in instant messaging apps, short message services, and social networking services. Recent research reveals that the normal use of in...
展开
The predominant user activities on mobile architectures (e.g., smartphones) involve entering text in instant messaging apps, short message services, and social networking services. Recent research reveals that the normal use of input methods drains approximately half of the battery capacity due to their above-average power requirements and frequent use. In this paper, we first study the power characteristics of mobile input methods and find that they consistently over-provision resources to satisfy users. For example, the psychophysical evidence available indicates the response of spell-checking features within a time threshold makes users feel they have instant feedback. However, current systems perform it very quickly, which is imperceptible to users and costly in terms of energy use. Given this over- provisioning, the system can be slowed down to save energy while retaining the feeling of instant response. Inspired by this observation, we also exploit several other psychophysical facts to identify the exact criteria to satisfy users. As a result, we present a user experience-oriented technology, utexia, to optimize the energy use of mobile input methods. The evaluation shows that utexia conserves up to 42.9% in energy use while strictly ensuring a good user experience.
收起
摘要 :
Convolutional Neural Networks (CNN) are verycomputation-intensive. Recently, a lot of CNN accelerators based on the CNN intrinsic parallelism are proposed. However, we observed that there is a big mismatch between the parallel typ...
展开
Convolutional Neural Networks (CNN) are verycomputation-intensive. Recently, a lot of CNN accelerators based on the CNN intrinsic parallelism are proposed. However, we observed that there is a big mismatch between the parallel types supported by computing engine and the dominant parallel types of CNN workloads. This mismatch seriously degrades resource utilization of existing accelerators. In this paper, we propose aflexible dataflow architecture (FlexFlow) that can leverage the complementary effects among feature map, neuron, and synapse parallelism to mitigate the mismatch. We evaluated our design with six typical practical workloads, it acquires 2-10x performance speedup and 2.5-10x power efficiency improvement compared with three state-of-the-art accelerator architectures. Meanwhile, FlexFlow is highly scalable with growing computing engine scale.
收起
摘要 :
Convolutional Neural Networks (CNN) are verycomputation-intensive. Recently, a lot of CNN accelerators based on the CNN intrinsic parallelism are proposed. However, we observed that there is a big mismatch between the parallel typ...
展开
Convolutional Neural Networks (CNN) are verycomputation-intensive. Recently, a lot of CNN accelerators based on the CNN intrinsic parallelism are proposed. However, we observed that there is a big mismatch between the parallel types supported by computing engine and the dominant parallel types of CNN workloads. This mismatch seriously degrades resource utilization of existing accelerators. In this paper, we propose aflexible dataflow architecture (FlexFlow) that can leverage the complementary effects among feature map, neuron, and synapse parallelism to mitigate the mismatch. We evaluated our design with six typical practical workloads, it acquires 2-10x performance speedup and 2.5-10x power efficiency improvement compared with three state-of-the-art accelerator architectures. Meanwhile, FlexFlow is highly scalable with growing computing engine scale.
收起
摘要 :
Process, Voltage, and Temperature (PVT) variations can significantly degrade the performance benefits expected from next nanoscale technology. The primary circuit implication of the PVT variations is the resultant timing emergenci...
展开
Process, Voltage, and Temperature (PVT) variations can significantly degrade the performance benefits expected from next nanoscale technology. The primary circuit implication of the PVT variations is the resultant timing emergencies. In a multi-core processor running multiple programs, variations create spatial and temporal unbalance across the processing cores. Most prior schemes are dedicated to tolerating PVT variations individually for a single core, but ignore the opportunity of leveraging the complementary effects between variations and the intrinsic variation unbalance among individual cores. We find that the notorious delay impacts from different variations are not necessary aggregated. Cores with mild variations can share the violent workload from cores suffering large variations. If operated correctly, variations on different cores can help mitigating each other and result in a variation-mild environment. In this paper, we propose Timing Emergency Aware Thread Migration (TEA-TM), a delay sensor-based scheme to reduce system timing emergencies under PVT variations. Fourier transform and frequency domain analysis are conducted to provide the insights and the potential of the PVT co-optimization scheme. Experimental results show on average TEA-TM can help save up to 24% throughput loss, at the same time improve the system fairness by 85%.
收起
摘要 :
Process, Voltage, and Temperature (PVT) variations can significantly degrade the performance benefits expected from next nanoscale technology. The primary circuit implication of the PVT variations is the resultant timing emergenci...
展开
Process, Voltage, and Temperature (PVT) variations can significantly degrade the performance benefits expected from next nanoscale technology. The primary circuit implication of the PVT variations is the resultant timing emergencies. In a multi-core processor running multiple programs, variations create spatial and temporal unbalance across the processing cores. Most prior schemes are dedicated to tolerating PVT variations individually for a single core, but ignore the opportunity of leveraging the complementary effects between variations and the intrinsic variation unbalance among individual cores. We find that the notorious delay impacts from different variations are not necessary aggregated. Cores with mild variations can share the violent workload from cores suffering large variations. If operated correctly, variations on different cores can help mitigating each other and result in a variation-mild environment. In this paper, we propose Timing Emergency Aware Thread Migration (TEA-TM), a delay sensor-based scheme to reduce system timing emergencies under PVT variations. Fourier transform and frequency domain analysis are conducted to provide the insights and the potential of the PVT co-optimization scheme. Experimental results show on average TEA-TM can help save up to 24% throughput loss, at the same time improve the system fairness by 85%.
收起
摘要 :
High-resolution (HR) videos have become popular due to the widespread adoption of high-definition displays. Super-resolution (SR) techniques aim to recover HR frames from low-resolution (LR) frames. Recently, deep neural network (...
展开
High-resolution (HR) videos have become popular due to the widespread adoption of high-definition displays. Super-resolution (SR) techniques aim to recover HR frames from low-resolution (LR) frames. Recently, deep neural network (DNN)-based SR methods have achieved superior quality compared to traditional methods. FPGA-based SR accelerators have been proposed to optimize performance and power efficiency. However, most accelerators tailored for video SR only accept uncompressed video frames and operate per-frame DNN inference, ignoring the temporal-spatial information in compressed video bitstreams. In contrast, we observe that non-key frames can be directly constructed using codec information and HR key-frames, saving a significant amount of DNN computing. In this paper, we propose a novel compressed video SR flow and a specific FPGA accelerator called Co-ViSu that integrates decoder, SR, and encoder engines. Co-ViSu exploits codec information reuse scheme to skip non-key frame decoding, avoid complex DNN computation and speed up encoding. Our experimental results show that Co-ViSu achieves 3.6x to 9.4x performance, 4.2x energy efficiency gain with only 0.17dB quality loss compared to the traditional flow, and 2.1x throughput than state-of-the-art.
收起
摘要 :
Unspent Transaction Output (UTXO) set is the foundational model used in many blockchain systems to represent assets. The benefits of UTXO representation include parallel processing, privacy, etc. However, the increasing size of UT...
展开
Unspent Transaction Output (UTXO) set is the foundational model used in many blockchain systems to represent assets. The benefits of UTXO representation include parallel processing, privacy, etc. However, the increasing size of UTXO set is degrading the access performance and severely brings down the validation speed of blockchain further, especially in resource-constrained scenerios, such as IoT. In this paper, we present a memory-economical storage system for UTXO-based blockchain. Based on the inherent properties of UTXO set, we propose two lossless compression techniques to reduce the memory space occupied by UTXO set. Besides, the database related operations are adopted to make the proposed mechanism easily applied in current blockchain system. Taking Bitcoin as the object of study, our mechanism can deliver 2.9-4.5x memory reduction and orders of magnitude validation speed improvement in resource-constrained situations. This compact system will improve validation performance and extend applied scope of blockchains.
收起
摘要 :
Unspent Transaction Output (UTXO) set is the foundational model used in many blockchain systems to represent assets. The benefits of UTXO representation include parallel processing, privacy, etc. However, the increasing size of UT...
展开
Unspent Transaction Output (UTXO) set is the foundational model used in many blockchain systems to represent assets. The benefits of UTXO representation include parallel processing, privacy, etc. However, the increasing size of UTXO set is degrading the access performance and severely brings down the validation speed of blockchain further, especially in resource-constrained scenerios, such as IoT. In this paper, we present a memory-economical storage system for UTXO-based blockchain. Based on the inherent properties of UTXO set, we propose two lossless compression techniques to reduce the memory space occupied by UTXO set. Besides, the database related operations are adopted to make the proposed mechanism easily applied in current blockchain system. Taking Bitcoin as the object of study, our mechanism can deliver 2.9-4.5x memory reduction and orders of magnitude validation speed improvement in resource-constrained situations. This compact system will improve validation performance and extend applied scope of blockchains.
收起
摘要 :
Inference efficiency is the predominant consideration in designing deep learning accelerators. Previous work mainly focuses on skipping zero values to deal with remarkable ineffectual computation, while zero bits in non-zero value...
展开
Inference efficiency is the predominant consideration in designing deep learning accelerators. Previous work mainly focuses on skipping zero values to deal with remarkable ineffectual computation, while zero bits in non-zero values, as another major source of ineffectual computation, is often ignored. The reason lies on the difficulty of extracting essential bits during operating multiply-and-accumulate (MAC) in the processing element. Based on the fact that zero bits occupy as high as 68.9% fraction in the overall weights of modern deep convolutional neural network models, this paper firstly proposes a weight kneading technique that could eliminate ineffectual computation caused by either zero value weights or zero bits in non-zero weights, simultaneously. Besides, a split-and-accumulate (SAC) computing pattern in replacement of conventional MAC, as well as the corresponding hardware accelerator design called Tetris are proposed to support weight kneading at the hardware level. Experimental results prove that Tetris could speed up inference up to 1.50x, and improve power efficiency up to 5.33x compared with the state-of-the-art baselines.
收起
摘要 :
Inference efficiency is the predominant consideration in designing deep learning accelerators. Previous work mainly focuses on skipping zero values to deal with remarkable ineffectual computation, while zero bits in non-zero value...
展开
Inference efficiency is the predominant consideration in designing deep learning accelerators. Previous work mainly focuses on skipping zero values to deal with remarkable ineffectual computation, while zero bits in non-zero values, as another major source of ineffectual computation, is often ignored. The reason lies on the difficulty of extracting essential bits during operating multiply-and-accumulate (MAC) in the processing element. Based on the fact that zero bits occupy as high as 68.9% fraction in the overall weights of modern deep convolutional neural network models, this paper firstly proposes a weight kneading technique that could eliminate ineffectual computation caused by either zero value weights or zero bits in non-zero weights, simultaneously. Besides, a split-and-accumulate (SAC) computing pattern in replacement of conventional MAC, as well as the corresponding hardware accelerator design called Tetris are proposed to support weight kneading at the hardware level. Experimental results prove that Tetris could speed up inference up to 1.50x, and improve power efficiency up to 5.33x compared with the state-of-the-art baselines.
收起